Goto

Collaborating Authors

 Beaverton


An Agent-Based Framework for the Automatic Validation of Mathematical Optimization Models

Zadorojniy, Alexander, Wasserkrug, Segev, Farchi, Eitan

arXiv.org Artificial Intelligence

Recently, using Large Language Models (LLMs) to generate optimization models from natural language descriptions has became increasingly popular. However, a major open question is how to validate that the generated models are correct and satisfy the requirements defined in the natural language description. In this work, we propose a novel agent-based method for automatic validation of optimization models that builds upon and extends methods from software testing to address optimization modeling . This method consists of several agents that initially generate a problem-level testing API, then generate tests utilizing this API, and, lastly, generate mutations specific to the optimization model (a well-known software testing technique assessing the fault detection power of the test suite). In this work, we detail this validation framework and show, through experiments, the high quality of validation provided by this agent ensemble in terms of the well-known software testing measure called mutation coverage.


Nike's Robotic Shoe Gets Humans One Step Closer to Cyborg

WIRED

Nike's Robotic Shoe Gets Humans One Step Closer to Cyborg Project Amplify is Nike's latest attempt to put some spring in your step with help from a powered mechanism that enhances the natural movement of the human ankle and lower leg. If you want to run faster or farther, you have options. You can put in the work, getting up 40 minutes earlier to train, changing your diet, going harder and longer on each of your runs to build up strength. Or, you can strap on one of Nike's new robot shoes and mechanically boost your speed, your stamina, and your overall performance in a flash. Sounds way easier, and probably more fun too.


A Foundation Model for Spatial Proteomics

Shaban, Muhammad, Chang, Yuzhou, Qiu, Huaying, Yeo, Yao Yu, Song, Andrew H., Jaume, Guillaume, Wang, Yuchen, Weishaupt, Luca L., Ding, Tong, Vaidya, Anurag, Lamane, Abdallah, Shao, Daniel, Zidane, Mohammed, Bai, Yunhao, McCallum, Paige, Luo, Shuli, Wu, Wenrui, Wang, Yang, Cramer, Precious, Chan, Chi Ngai, Stephan, Pierre, Schaffenrath, Johanna, Lee, Jia Le, Michel, Hendrik A., Tian, Caiwei, Almagro-Perez, Cristina, Wagner, Sophia J., Sahai, Sharifa, Lu, Ming Y., Chen, Richard J., Zhang, Andrew, Gonzales, Mark Edward M., Makky, Ahmad, Lee, Jia-Ying Joey, Cheng, Hao, Ahmar, Nourhan El, Matar, Sayed, Haist, Maximilian, Phillips, Darci, Tan, Yuqi, Nolan, Garry P., Burack, W. Richard, Estes, Jacob D., Liu, Jonathan T. C., Choueiri, Toni K, Agarwal, Neeraj, Barry, Marc, Rodig, Scott J., Le, Long Phi, Gerber, Georg, Schürch, Christian M., Theis, Fabian J., Kim, Youn H, Yeong, Joe, Signoretti, Sabina, Howitt, Brooke E., Loo, Lit-Hsin, Ma, Qin, Jiang, Sizun, Mahmood, Faisal

arXiv.org Artificial Intelligence

Foundation models have begun to transform image analysis by acting as pretrained generalist backbones that can be adapted to many tasks even when post-training data are limited, yet their impact on spatial proteomics, imaging that maps proteins at single-cell resolution, remains limited. Here, we introduce KRONOS, a foundation model built for spatial proteomics. KRONOS was trained in a self-supervised manner on over 47 million image patches covering 175 protein markers, 16 tissue types, and 8 fluorescence-based imaging platforms. We introduce key architectural adaptations to address the high-dimensional, multi-channel, and heterogeneous nature of multiplex imaging. We demonstrate that KRONOS learns biologically meaningful representations across multiple scales, ranging from cellular and microenvironment to tissue levels, enabling it to address diverse downstream tasks, including cell phenotyping, region classification, and patient stratification. Evaluated across 11 independent cohorts, KRONOS achieves state-of-the-art performance across cell phenotyping, treatment response prediction, and retrieval tasks, and is highly data-efficient. KRONOS also introduces the paradigm of segmentation-free patch-level processing for efficient and scalable spatial proteomics analysis, allowing cross-institutional comparisons, and as an image reverse search engine for spatial patterns.


Comparing Human and AI Rater Effects Using the Many-Facet Rasch Model

Jiao, Hong, Song, Dan, Lee, Won-Chan

arXiv.org Artificial Intelligence

Large language models (LLMs) have been widely explored for automated scoring in low-stakes assessment to facilitate learning and instruction. Empirical evidence related to which LLM produces the most reliable scores and induces least rater effects needs to be collected before the use of LLMs for automated scoring in practice. This study compared ten LLMs (ChatGPT 3.5, ChatGPT 4, ChatGPT 4o, OpenAI o1, Claude 3.5 Sonnet, Gemini 1.5, Gemini 1.5 Pro, Gemini 2.0, as well as DeepSeek V3, and DeepSeek R1) with human expert raters in scoring two types of writing tasks. The accuracy of the holistic and analytic scores from LLMs compared with human raters was evaluated in terms of Quadratic Weighted Kappa. Intra-rater consistency across prompts was compared in terms of Cronbach Alpha. Rater effects of LLMs were evaluated and compared with human raters using the Many-Facet Rasch model. The results in general supported the use of ChatGPT 4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet with high scoring accuracy, better rater reliability, and less rater effects.


A RAG-Based Multi-Agent LLM System for Natural Hazard Resilience and Adaptation

Xie, Yangxinyu, Jiang, Bowen, Mallick, Tanwi, Bergerson, Joshua David, Hutchison, John K., Verner, Duane R., Branham, Jordan, Alexander, M. Ross, Ross, Robert B., Feng, Yan, Levy, Leslie-Anne, Su, Weijie, Taylor, Camillo J.

arXiv.org Artificial Intelligence

Large language models (LLMs) are a transformational capability at the frontier of artificial intelligence and machine learning that can support decision-makers in addressing pressing societal challenges such as extreme natural hazard events. As generalized models, LLMs often struggle to provide context-specific information, particularly in areas requiring specialized knowledge. In this work we propose a retrieval-augmented generation (RAG)-based multi-agent LLM system to support analysis and decision-making in the context of natural hazards and extreme weather events. As a proof of concept, we present WildfireGPT, a specialized system focused on wildfire hazards. The architecture employs a user-centered, multi-agent design to deliver tailored risk insights across diverse stakeholder groups. By integrating natural hazard and extreme weather projection data, observational datasets, and scientific literature through an RAG framework, the system ensures both the accuracy and contextual relevance of the information it provides. Evaluation across ten expert-led case studies demonstrates that WildfireGPT significantly outperforms existing LLM-based solutions for decision support.


Analyzing Examinee Comments using DistilBERT and Machine Learning to Ensure Quality Control in Exam Content

Ye, null, Ma, null

arXiv.org Artificial Intelligence

To ensure that the items are of sufficient quality to be included in the test, multiple rounds of item review are conducted both before and after the test is administered. Typically, once the testing period has ended, psychometricians will analyze the response data using var ious methods to identify any items that require further review based on their statistical properties (e.g., p - value, point - biserial correlation, etc.). For example, one item with a low point - biserial correlation value can be flagged for further review due to poor discrimination. While flagging items using their statistics can help identify potentially problematic items, it does not guarantee that the flagged items actually contain issues. Therefore, subject matter experts (SMEs) need to review the flagged items to determine whether they indeed pose any problems.


MEMHD: Memory-Efficient Multi-Centroid Hyperdimensional Computing for Fully-Utilized In-Memory Computing Architectures

Kang, Do Yeong, Oh, Yeong Hwan, Hwang, Chanwook, Kim, Jinhee, Jeon, Kang Eun, Ko, Jong Hwan

arXiv.org Artificial Intelligence

The implementation of Hyperdimensional Computing (HDC) on In-Memory Computing (IMC) architectures faces significant challenges due to the mismatch between highdimensional vectors and IMC array sizes, leading to inefficient memory utilization and increased computation cycles. This paper presents MEMHD, a Memory-Efficient Multi-centroid HDC framework designed to address these challenges. MEMHD introduces a clustering-based initialization method and quantization aware iterative learning for multi-centroid associative memory. Through these approaches and its overall architecture, MEMHD achieves a significant reduction in memory requirements while maintaining or improving classification accuracy. Our approach achieves full utilization of IMC arrays and enables one-shot (or few-shot) associative search. Experimental results demonstrate that MEMHD outperforms state-of-the-art binary HDC models, achieving up to 13.69% higher accuracy with the same memory usage, or 13.25x more memory efficiency at the same accuracy level. Moreover, MEMHD reduces computation cycles by up to 80x and array usage by up to 71x compared to baseline IMC mapping methods when mapped to 128x128 IMC arrays, while significantly improving energy and computation cycle efficiency.


Disrupting Test Development with AI Assistants

Joshi, Vijay, Band, Iver

arXiv.org Artificial Intelligence

Recent advancements in large language models, including GPT-4 and its variants, and Generative AI-assisted coding tools like GitHub Copilot, ChatGPT, and Tabnine, have significantly transformed software development. This paper analyzes how these innovations impact productivity and software test development metrics. These tools enable developers to generate complete software programs with minimal human intervention before deployment. However, thorough review and testing by developers are still crucial. Utilizing the Test Pyramid concept, which categorizes tests into unit, integration, and end-to-end tests, we evaluate three popular AI coding assistants by generating and comparing unit tests for opensource modules. Our findings show that AI-generated tests are of equivalent quality to original tests, highlighting differences in usage and results among the tools. This research enhances the understanding and capabilities of AI-assistant tools in automated testing.


Uncertainty-preserving deep knowledge tracing with state-space models

Christie, S. Thomas, Cook, Carson, Rafferty, Anna N.

arXiv.org Artificial Intelligence

A central goal of both knowledge tracing and traditional assessment is to quantify student knowledge and skills at a given point in time. Deep knowledge tracing flexibly considers a student's response history but does not quantify epistemic uncertainty, while IRT and CDM compute measurement error but only consider responses to individual tests in isolation from a student's past responses. Elo and BKT could bridge this divide, but the simplicity of the underlying models limits information sharing across skills and imposes strong inductive biases. To overcome these limitations, we introduce Dynamic LENS, a modeling paradigm that combines the flexible uncertainty-preserving properties of variational autoencoders with the principled information integration of Bayesian state-space models. Dynamic LENS allows information from student responses to be collected across time, while treating responses from the same test as exchangeable observations generated by a shared latent state. It represents student knowledge as Gaussian distributions in high-dimensional space and combines estimates both within tests and across time using Bayesian updating. We show that Dynamic LENS has similar predictive performance to competing models, while preserving the epistemic uncertainty - the deep learning analogue to measurement error - that DKT models lack. This approach provides a conceptual bridge across an important divide between models designed for formative practice and summative assessment.


Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses

Gupta, Gaurav Kumar, Singh, Aditi, Manikandan, Sijo Valayakkad, Ehtesham, Abul

arXiv.org Artificial Intelligence

The recent swift development of LLMs like GPT-4, Gemini, and GPT-3.5 offers a transformative opportunity in medicine and healthcare, especially in digital diagnostics. This study evaluates each model diagnostic abilities by interpreting a user symptoms and determining diagnoses that fit well with common illnesses, and it demonstrates how each of these models could significantly increase diagnostic accuracy and efficiency. Through a series of diagnostic prompts based on symptoms from medical databases, GPT-4 demonstrates higher diagnostic accuracy from its deep and complete history of training on medical data. Meanwhile, Gemini performs with high precision as a critical tool in disease triage, demonstrating its potential to be a reliable model when physicians are trying to make high-risk diagnoses. GPT-3.5, though slightly less advanced, is a good tool for medical diagnostics. This study highlights the need to study LLMs for healthcare and clinical practices with more care and attention, ensuring that any system utilizing LLMs promotes patient privacy and complies with health information privacy laws such as HIPAA compliance, as well as the social consequences that affect the varied individuals in complex healthcare contexts. This study marks the start of a larger future effort to study the various ways in which assigning ethical concerns to LLMs task of learning from human biases could unearth new ways to apply AI in complex medical settings.